Feature and decision-level fusion for schizophrenia detection based on resting-state fMRI data

Mental disorders, especially schizophrenia, still pose a great challenge for diagnosis in early stages. Recently, computer-aided diagnosis techniques based on resting-state functional magnetic resonance imaging (Rs-fMRI) have been developed to tackle this challenge. In this work, we investigate different decision-level and feature-level fusion schemes for discriminating between schizophrenic and normal subjects. Four types of fMRI features are investigated, namely the regional homogeneity, voxel-mirrored homotopic connectivity, fractional amplitude of low-frequency fluctuations and amplitude of low-frequency fluctuations. Data denoising and preprocessing were first applied, followed by the feature extraction module. Four different feature selection algorithms were applied, and the best discriminative features were selected using the algorithm of feature selection via concave minimization (FSV). Support vector machine classifiers were trained and tested on the COBRE dataset formed of 70 schizophrenic subjects and 70 healthy subjects. The decision-level fusion method outperformed the single-feature-type approaches and achieved a 97.85% accuracy, a 98.33% sensitivity, a 96.83% specificity. Moreover, feature-fusion scheme resulted in a 98.57% accuracy, a 99.71% sensitivity, a 97.66% specificity, and an area under the ROC curve of 0.9984. In general, decision-level and feature-level fusion schemes boosted the performance of schizophrenia detectors based on fMRI features.


Introduction
Early diagnosis of mental disorders is considered a challenging task. Schizophrenia is one of these chronic mental disorders that typically appear in late adolescence or early adulthood and affect about 1% of the population around the world [1][2][3][4]. This disorder is characterized by hallucinations, delusions and negative symptoms such as social withdrawal, self neglect, etc. [5].

PLOS ONE
PLOS ONE | https://doi.org/10.1371/journal.pone.0265300 May 24, 2022 1 / 20 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 before feature selection. The different schemes are shown in Fig 1, where the colored arrows represent the flow between the different modules forming the different schemes. The red arrows trace the flow of the single-feature-type schemes, while the blue arrows are associated with the feature-level fusion schemes. For the decision-level fusion schemes, the single-feature-type schemes are combined through the orange arrows to reach the final decisions. The rest of the paper is organized as follows. Section II describes the whole system modules including dataset description, preprocessing, resting-state functional activity measures, feature selection, and pattern classification. Section III reports and discusses the experimental results. Section IV concludes the paper and gives suggestions for future work.

Dataset description
We employed a dataset created for studying the neural mechanisms of schizophrenia. This dataset was collected by the Center for Biomedical Research Excellence (COBRE) http:// fcon1000.projects.nitrc.org/indi/retro/cobre.html through the Mind Research Network for Neurodiagnostic Discovery (MRN) at the University of New Mexico. The dataset was collected in accordance with the recommendations of the Declaration of Helsinki. The COBRE data acquisition protocol was approved by the Institutional Review Boards of all of the participating institutions. Informed written consent was obtained from all participants at each site. The dataset includes raw functional and anatomical MR data for 140 subjects distributed equally for normal and schizophrenic patients. Diagnostic information was collected using the structured clinical interview utilized for DSM disorders (SCID). A multi-echo MPRAGE (MEMPR) sequence was utilized for anatomical imaging with the following parameters: TR/TE/ TI = 2530/ [1.64, 3.5, 5.36, 7.22, 9.08]/900 ms, flip angle = 7, slab thickness = 176 mm, FOV = 256 × 256 mm, data matrix = 256 × 256 × 176, number of echoes = 5, total scan time = 6 min, voxel size = 1 × 1 × 1 mm, pixel bandwidth = 650 Hz. With 5 echoes, the TI, TR and time to encrypt partitions for MEMPR are similar to those of conventional MPRAGE, and lead to similar GM/WM/CSF contrast. Data for Rs-fMRI was collected using echo planar imaging (EPI) with ramp sampling correction using the intercomissural line (AC-PC) as a reference (TR: 2 s, TE: 29 ms, matrix size: 64 × 64, 32 slices, voxel size: 3 × 3 × 4mm 3 ). Rs-fMRI, anatomical MRI, and phenotypic data are recorded for every subject. A brief summary of the demographic data found in the COBRE schizophrenia dataset is shown in Table 1.

Data preprocessing
All the preprocessing steps were executed in MATLAB using the data processing & analysis for brain imaging (DPABI) software tool [63]. For each participant, slice time correction was applied for interleaved acquisition. Head motion correction based on Friston's 24-parameter motion model [64] was performed. Co-registration of structural and functional images in order to map the functional information to the anatomical space was executed. Then, the images were spatially normalized to the Montreal Neurological Institute (MNI) standard using the DARTEL template [65] and resampled to 3 × 3 × 3 mm 3 . The generated images were spatially smoothed with a 4-mm full-width half-maximum (FWHM) Gaussian kernel. Moreover, the images were linearly detrended and temporally filtered by a bandpass filter (0.01-0.1 Hz) to reduce low-frequency drifts and remove physiological high-frequency noise [66].

Resting-state functional activity measures
We explain here the functional activity measures calculated over the Rs-fMRI dataset.
2.3.1 ALFF and fALFF features. The ALFF [67] and fALFF [68] features measure the magnitude of low frequency fluctuations (LFFs) of the BOLD signal. The ALFF features are calculated as the average power spectrum, obtained across the range (0.01-0.1 Hz) for each voxel. The fALFF features represent the ratio of the signal power of the low-frequency range (0.01 to 0.1 Hz) to the power associated with the total detectable frequency range (0 to 0.25 Hz) [67].

VMHC features.
This feature set measures the brain functional homotopy through a voxel-wise measure of connectivity among the brain hemispheres, under the assumption of synchrony in spontaneous brain activity among the homotopic regions for each hemisphere. An approximation of the homotopic connectivity is computed between an individual voxel in one of the brain hemispheres and its symmetric counterpart in the other hemisphere of the brain, assuming morphology regularity between them. This connectivity is calculated based on the Pearson correlation coefficient between voxel pairs across the hemispheres. Then, the Pearson correlation coefficient value is converted into a Fisher z-transformed value representing the VMHC activity measures [69].

ReHo features.
This feature set measures the similarity between the time series of a particular voxel and those of its nearest neighbors [70]. To compute the similarities inside a voxel cluster, Kendall's coefficient of concordance (KCC) is applied. Here, each cluster  [71]. For each of the above-mentioned four feature types, 271633 features were originally extracted.

Feature selection
To improve the classification performance, we examined 15 different feature selection measures [72]. Each feature selection algorithm provides a list of features ranked by the feature strength or discriminability from the most discriminating feature to the least discriminating one. Then, we followed a sequential forward selection approach. That is, we use the most significant feature, followed by the most two significant features, where the system has been trained and the performance is evaluated using the validation dataset for each case. This process is continued where the significant feature subset is enlarged by adding one feature at a time. This process was employed for the different feature selection algorithms. We selected the following four top-performing feature selection algorithms: feature selection via concave minimization (FSV) [73], L0-norm [74], Relief [75] and Wilcoxon sum-rank test [76].

Decision and feature fusion
Several schemes for decision-level and feature-level fusion are investigated to improve the schizophrenia detection performance. For decision-level fusion, we applied majority voting on three SVM classifiers which are based on the ReHo, VMHC, and fALFF feature types, respectively. Feature-level fusion combines different fMRI feature types to exploit the strengths of each type.

Support vector machine (SVM) classification
A support vector machine (SVM) is a linear classifier that learns the best hyperplane that has the maximum possible distance to the closest data point in the training set belonging to any class using the support vectors. Thus, a SVM is typically more robust compared to other classifiers. The SVM effectiveness is enhanced mainly through the employment of the kernel trick in order to handle data nonlinearity in the feature space. This trick basically transforms the data points into a higher-dimensional feature space in order to increase the linear separability between the data points. The most commonly used kernels include linear, quadratic, and radial basis function (RBF) kernels [77].
After feature selection and fusion, supervised machine learning procedures were used to discriminate the schizophrenic patients from healthy subjects. The COBRE dataset, employed in this work, was divided into training, validation and testing subsets. Training and testing of the classifiers were performed using the LIBSVM http://www.csie.ntu.edu.tw/cjlin/libsvm/ library. SVM, with a linear kernel, was employed for the classification task for all proposed schemes, with soft margin C = 10. Five repetitions of nested loop 10-fold cross-validation were performed, in this study, in order to demonstrate the robustness of the system, where the inner 10-fold cross validation was employed only in the feature selection task, i.e. selecting the optimum number of features. It is worth noting that the inner 10-fold cross-validation was employed only in selecting the optimum number of features, while the SVM kernel and parameters were selected by trial and error.
For each fold in the outer loop, one-tenth of the data is randomly selected for the testing and performance evaluation, while the rest is employed in training and validation in the inner 10-fold cross-validation loop. In this inner loop, one-tenth of the remaining samples is randomly selected for the validation and optimization of the feature selection process, while the rest is employed for classifier training. Accuracy, specificity and sensitivity are used to evaluate the classifier performance, select the hyper parameters and verify the system robustness. The overall system performance was evaluated using the accuracy, sensitivity and specificity, which are computed as follows: where the true positive (TP) is the number of correctly classified schizophrenia patients, the false positive (FP) is the number of healthy subjects incorrectly classified as schizophrenia patients, the true negative (TN) is the number of correctly classified healthy subjects, and the false negative (FN) is the number of schizophrenia patients incorrectly classified as healthy subjects. Moreover, the 5x2 cross-validation statistical test was employed to measure the statistical significance of the difference in accuracy between the classifier based on the ReHo activity measure and the classifiers based on other activity measures [78]. The test is carried out as follows. Let A be the ReHo-based classifier and B be the classifier based on another activity measure. The null hypothesis is that the ReHo-based classifier A has the same accuracy as the other classifier B. The alternative hypothesis is that the two classifiers have different accuracies. For each classifier, five repetitions of 2-fold cross-validation are made. Then, for each pair of classifiers, the differences in accuracy are used to compute the following t-statistic: 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 5 where • p ð1Þ 1 is the difference of the classifier scores for the first fold of the first iteration, • S 2 i is the estimated variance of the score difference for the i th iteration (This variance is computed as ðp ð1Þ • p ðjÞ i is the difference of the classifier scores for the j th fold of the i th iteration, is the mean score for the i th repetition over the two associated folds. The t-statistic is assumed to follow a t-distribution with 5 degrees of freedom. We assume a significance level of 0.05. The corresponding threshold is t � = 2.57. For any two classifiers, the null hypothesis is rejected (i.e. the difference in accuracy for the two classifiers is statistically significance) if |t| > t � . Thus, an absolute t-statistic larger than t � indicates that the null hypothesis can be rejected and hence that the ReHo-based classifier accuracy is indeed different from the accuracy of the other classifier.

Classification outcomes
We experimented with 15 feature selection algorithms and reported the top four best performing ones as shown in Table 2.
Our experimental results show that the FSV method gives the best performance for the COBRE http://fcon1000.projects.nitrc.org/indi/retro/cobre.html dataset. The FSV algorithm lists the features according to their discriminability. In our experiments, we used the best single feature to train and test a SVM classifier and obtain the corresponding average validation accuracy over 10 folds. Then, we used the best two features to train and test a SVM classifier and obtain the corresponding average validation accuracy. At each stage of the experiments, the number of used features was increased until all of the features were used for training and validation. The best number of features corresponding to the classifier with the highest average validation accuracy was selected and employed for testing in the outer loop.
We found 95% confidence intervals for the classifier performance using the Wilson score interval method [79,80]. Among the four types of activity measures (i.e. ALFF, fALFF, ReHo and VMHC), the best average test accuracy of 94.57% is achieved by using the ReHO activity measure with 83 discriminative features as shown in Table 3.
We investigated the fusion of decisions and features based on Rs-fMRI activity in order to improve the classification performance. The decision-fusion scheme, shown in Fig 2, computes the ReHo, VMHC, and fALFF features, classifies each feature type using a SVM, and fuses the three decisions. Then, a majority vote is carried among the three classifiers to get the final decision. Table 3 shows the results of the decision-level fusion scheme of Fig 2. Table 3 shows the sensitivity, specificity, and accuracy measures on the test set for each feature type. Our accuracy based on fALFF (92.71%), in Table 3, is significantly better than the 75% accuracy reported by Guo et al. [56]. Moreover, we achieved better results than those obtained by Chyzhyk et al.
[57] based on the 4 features types as shown in Table 3. Furthermore, we investigated feature-level fusion schemes. Table 4 shows schizophrenia detection results with different pairwise, triple, and quadruple combinations of the ALFF, fALFF, ReHo and VMHC feature types. Feature selection was applied after combining the features, and the number of features associated with the minimum validation error was selected for each combination. The pairwise combination of the ALFF and fALFF feature types resulted in the best accuracy (97.71%), specificity (97.80%), and sensitivity 98.80%. The ALFF, fALFF, and VMHC combination achieves the best performance among all triple combinations. As well, feature-level

PLOS ONE
Feature and decision-level fusion for schizophrenia detection based on resting-state fMRI data fusion of the four feature types leads clearly to the best overall performance metrics with an accuracy of 98.71% and a sensitivity of 99.71%.

Statistical significance testing
In order to test the statistical significance of the performance differences between the classifiers listed in Tables 3 and 4, we computed t-statistics based on Eq (4). The classifier based on the ReHo activity measure was employed as the reference algorithm since it provided the highest accuracy in the case of single-feature-type classifiers. The resulting p-values for all significance tests associated with single-feature-type classifiers and the decision-level classifier are listed in Table 5. On the one hand, the statistical results show that the difference in accuracy between the ReHo-based and VMHC-based classifiers is not statistically significant. On the other hand, the differences in accuracy between the ReHo-based classifier and the fALFF-based classifier, the ALFF-based classifier, and the decision-level fusion classifier are statistically significant. For the feature fusion classifiers, the p-values listed in Table 6 show statistically significant improvements for all classifiers (except for two classifiers: the ALFF-ReHo-VMHC and fALF-F-ReHo-VMHC classifiers) over the ReHo-based classifier.

ROC analysis
Receiver operating characteristic (ROC) curves were generated for each of the 10 folds of each experiment. An average ROC curve can be obtained by projecting curves from two-dimensional space onto a single dimension and averaging them traditionally. However, this projection raises questions of appropriateness and conservation of the characteristics of interest. The vertical averaging [81] is employed to plot the ROC curvse, where the FP rates are fixed and the corresponding TP are averaged. We reported the results of vertical averaging method which achieves the best performance in comparison with the threshold averaging method . Fig 3(a)-3(c) shows the results based on vertical averaging for single, pairwise, triple and quadruple combinations of features, respectively.

Discriminative feature mapping
The discriminative ALFF, fALFF, ReHo and VMHC connectivity maps were constructed using a two-sample t-test to statistically verify the significance of the difference between healthy subjects and schizophrenic patients. The significance level was set at the corrected p < 0.05 for multiple testing using the false discovery rate (FDR) method [82] (min z > 2.3, cluster significance: p < 0.05). The most discriminative features for classification are shown in Table 7. Moreover, these regions are highlighted using BrainNet Viewer [83] in In comparison to healthy controls, the schizophrenic patients showed significant ReHo increases in the right LING and the right PCUN. Also, ALFF increases in the right PHG and the left PreCG, while fALFF increases in VER10 and the left ORBmid. As will, VMHC increases in the left PHG and the left ACG. Schizophrenic patients showed significant ReHo decreases in the left PCUN and the right ACG. Also, ALFF decreases in the right CERcr2 and

PLOS ONE
Feature and decision-level fusion for schizophrenia detection based on resting-state fMRI data

PLOS ONE
Feature and decision-level fusion for schizophrenia detection based on resting-state fMRI data the right TPOmid, while fALFF decreases in the right CERcr2 and the right SPG. In addition, VMHC decreases in the right CER7 and the left PoCG.

Robustness to noise
We performed some additional experiments to investigate the robustness of the proposed method with different feature combinations. In particular, we added Rician noise with two different levels of σ = 1 and σ = 2 to the test data. The performance outcomes under these noise conditions are summarized in Tables 8 and 9 for the single-feature-type classifiers and fusedfeature classifiers, respectively. Moreover, the performance outcomes under the noise conditions are visualized for the best classifiers with single, pairwise, triple, and quadruple feature combinations as well as decision-fusion classifier in Fig 5. For the best single-feature-type classifier, namely the ReHo-based classifier, the 95.57% detection accuracy dropped by 2.11% and 6.11% with the two noise levels, respectively. For the decision-level fusion classifier, the 97.85% detection accuracy dropped by 2.07% and 9.31% with the two noise levels, respectively. The best pairwise-feature-type classifier, i.e. the classifier based on the ALFF and fALFF features, the 97.71% detection accuracy dropped by 1.85% and 10% with the two noise levels, respectively. For the best triple-feature-type classifier, that is the one based on the ALFF-fALFF-VMHC feature combination, the 97.85% detection accuracy dropped by 2.13% and 9.92% with the two noise levels, respectively. Finally, for the classifier with the quadruple feature combination, the 98.71% detection accuracy dropped by 2.50% and 8% with the two noise levels, respectively. These results show that all classifiers essentially Table 8. Effects of Rician noise on schizophrenia detection performance with single-feature-type and decision-level fusion schemes for fMRI data contaminated with Rician noise levels of σ = 1 and σ = 2.

Discussion
In this paper, we investigated different fusion schemes of resting-state functional activity measures for discriminating between schizophrenic and healthy subjects. The improvements in classification performance can be ascribed to the choice of the feature selection and fusion approaches. For the decision fusion scheme in Table 3, the results outperformed those of the single-feature-type schemes. This shows that the fusion of the decisions of weak classifiers leads to more accurate classification performance [84]. Further, the feature fusion schemes in Table 4 show even better performance enhancements. This additional improvement can be ascribed to the fact that the feature-fusion schemes combine and optimize the selection of features, while the decision-level fusion scheme classifies samples by merely conducting a vote among only three single-feature-type classifiers. We can understand the improvements obtained by different variants of the feature fusion classifiers by looking at the contributions of the single feature types to each of these classifiers. Specifically, Fig 6 shows the selected percentages of each of the ALFF, fALFF, ReHo, and VMHC feature types for optimizing the performance of the classifiers with pairwise, triple, and quadruple feature combinations. Obviously, each combination has different selections of individual feature types. For example, for pairwise combinations, the ALFF feature type is dominated by the other features; the fALFF type is dominated by the ReHo and VMHC ones; and the ReHo type is slightly dominated by the VMHC one. For triple combinations, the ALFF type consistently shows zero or marginal contributions, while the fALFF features are generally dominated by the ReHo and VMHC features. For the quadruple feature combination associated with the best performance, the same pattern is observed where the ALFF type has no contribution, while the VMHC type has the highest contribution followed by the fALFF and ReHo types.
To put our results in context with other relevant studies, we examined the schizophrenia detection results in some of these studies. Firstly, the high performance metrics of our method agree with the high accuracies reported by other studies on the COBRE dataset. For example, Qureshi et al. [85] achieved an accuracy of 100% on the COBRE dataset with 10-fold cross-validation scheme and extreme learning machines (ELM). Also, Chyzhyk et al. [86] attained a 100% accuracy in the classification of healthy subjects and schizophrenic patients with and without auditory hallucinations. Juneja et al. [87] obtained a 98% accuracy on a multisite dataset from the Function Biomedical Informatics Research Network (FBIRN). Other schizophrenia detection methods achieved lower accuracies ranging from 62% to 94% for different variations of the schizophrenia classes, the image modalities and the collected datasets [88].
The vertical averaging method was employed to plot the ROC curves as shown in Fig 3. We noticed that the AUC values of the combinations of features are slightly better than those based on single feature types. The significant regions that are shown in Table 7 are in agreement with the previous findings. Studies based on ReHo features reported an increase in ReHo values in the right LING, the right PCUN and a decrease in ReHo values in the left PCUN and the right ACG [89,90]. ALFF-based studies reported significant ALFF increases in the right PHG and decreased ALFF values in the right PHG and the left PreCG [91]. Also, fALFF-based studies demonstrated significant fALFF increases in VER10 and the left ORBmid and decreased fALFF values in the right CERcr2 and the right SPG [91,92]. Finally, VMHC-based studies reported significant VMHC increases in the left PHG and the left ACG and decreased VMHC values in the right CER7 and the left PoCG [93]. As shown in Table 3, the standard deviation of the accuracy results for the outer 10-fold cross-validation scheme is considered small, and this demonstrates the low bias and high robustness of the proposed system for schizophrenia diagnosis. Moreover, the extracted features can be employed as biological markers that may help identify subjects at increased risk of disease development, and hence improve disease prognosis. Furthermore, the localization of the affected regions, in Table 7, can be employed in further research to localize and understand how the different regions are affected and changed through the progression of the disease.

Conclusion
In this paper, we introduced different feature-level and decision-level fusion schemes for discriminating between schizophrenic and healthy subjects and identifying schizophreniaaffected brain regions using whole-brain Rs-fMRI analysis. The highest average test accuracy of 98.71% was obtained with feature fusion of the four types of features considered in this paper. In summary, our work employs optimized feature selection algorithms, explores different fusion schemes, and exploits a large-scale dataset for schizophrenia detection. For future work, we seek to address this detection problem with deep learning and graph-theoretic techniques for two main reasons. Firstly, in practical applications for schizophrenia detection, the actual test data may be contaminated by noise and artifacts. Under these conditions, the handcrafted features may be not quite robust. This is why other graph-theoretic or deep learning methods with better noise robustness measures should be sought. Secondly, although the employed COBRE dataset is considered big compared to the other available datasets, a larger dataset is still needed with few hundreds of MRI volumes, acquired from different centers, with different MRI machines, and at different specifications. With these large real-world data variations, our current handcrafted-feature method may not achieve the same performance outcomes, and hence deep learning architectures would be needed to effectively capture the increasing data complexity.